# Load necessary libraries
library(tidyverse) # for data manipulation and visualization
library(viridis) # for color scales
library(ggpp) # for position_dodgenudge
library(ggrepel) # for text repelling
library(directlabels) # for categorical text labeling
library(ggforce) # for enclosing shapes
library(janitor) # for cleaning
library(ggplot2) # for plotting
library(plotly) # for interactive plots
library(tools) #
library(ggiraph)
library(RColorBrewer) # color palettes specifically tailored for data visualization
library(crosstalk)
library(htmltools)
library(dplyr)AAI1001 Team 7 Project Proposal
1 Project Proposal
Please ensure that you have installed these packages in your R environment before running the code chunks below.
2 Introduction
Using various packages in R, we will create a poster that thoughtfully displays the socioeconomic factors that influence fertility/birth rates in Singapore. To do so we will be using fertility rate data sourced from SingStat as well as labour participation and marital status data from data.gov.sg. To note that data for 1995, 2000 and 2005 are not available as the Comprehensive Labour Force Survey was not conducted in these years due to the conduct of the Population Census 2000, General Household Surveys 1995 and 2005 by the Singapore Department of Statistics.
2.1 Analysis of Original Visualisations
The original visualisations from the Straits Times are shown below,
Visualisations adapted from Straits Times: Singapore’s total fertility rate hits record low in 2023, falls below 1 for first time
The original visualisations focused on two variables: Singapore’s total fertility rate (quantitative) and years (quantitative) from 2019 to 2023. While these visualisations provide a clear overview of the declining fertility trend, they lack depth in exploring the underlying socioeconomic factors that contribute to this trend.
In a more recent article, Straits Times: Why the fertility rate doesn’t capture socio-economic or cultural trends, it critiques the visualizations represented, stating that the first article does not aptly consider socioeconmic and cultural factors which result in increasing singlehood among both men and women which has consequently led to a decline in marital fertility rates.
Key Weaknesses:
As per the response article, socioeconomic and cultural factors that influenced rising singlehood in Singapore were not considered.
Only Total Fertility Rate was used in the original article, this does not give insight into the fertility rates of different age groups
The graph presented in the article only displays 2019 to 2023, which does not provide a comprehensive view of the fertility trends over a longer period.
Strengths:
Clean and clear formatting.
Good headline-grabbing summary.
Uses official government data.
2.2 Proposed Improvements
These are the corresponding improvements that our group has identified:
Female Labour Participation & Marital Status: To show the difference in fertility trends between women in and out of the workforce, and by marital status.
Age band: To compare fertility rates across different age groups, allowing for a more nuanced understanding of how age impacts fertility.
Interactive time period selection: Using data from 1990 to 2022, we will create an interactive visualisation that allows users to select different time periods/intervals, as well as different age bands and marital statuses to explore how these factors influence fertility rates over time.
The final visualisation that we plan to create will have 5 variables:
Year: time series x-axis variable
Count (in thousands): quantitative y-axis variable that accommodates the unit of measurement from our datasets. Count will represent Births Per Thousand Females and population of Females in the labour force (in thousands)
marital status: categorical variable that will be differentiated by colour and appear as geom_col. There will be 3 marital statuses (single, married, widowed/divorced) that will be represented as separate bars in the visualisation
age band: categorical variable that will be differentiated by colour and appear as geom_line. There will be 7 age bands (15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49) and total fertility rate that will be represented as separate lines in the visualisation
labour participation: categorical variable that is switchable to show separate faceted graphs
We will also consider the additions of other variables such as yearly housing prices or other factors that may affect fertility rate.
Thus, a low fidelity mockup of the final visualisation is shown below:
3 Planned work distribution for each team member
| Team Member | Tasks |
|---|---|
| Guo Zi Qiang Robin | Data cleaning & data engineering for proposal submission |
| Chew Tze Han | pls ownself insert what u did |
| Cheong Wai Hong Jared | pls ownself insert what u did |
| Akram | pls ownself insert what u did |
| Gregory Tan | pls ownself insert what u did |
4 Data Engineering
The datasets will be loaded as 3 tibbles:
| Dataset | Description |
|---|---|
| fertility | Includes fertility rate data; contains metadata which needs to be skipped |
| work | Includes data on males and females in the labour force and marital status |
| not_working | Includes data on males and females out of the labour force and marital status |
4.1 Data Loading
fertility <- read_csv(
"datasets/ResidentFertilityRate.csv",
skip = 9,
n_max = 17,
show_col_types = FALSE
)
work <- read_csv("datasets/ResidentLabourForceAged15YearsandOverbyMaritalStatusAgeandSex.csv", show_col_types = FALSE)
not_working <- read_csv("datasets/ResidentsOutsidetheLabourForceAged15YearsandOverbyMaritalStatusAgeandSex.csv", show_col_types = FALSE)4.2 Cleaning
The tibbles “fertility”, “work”, and “not_working” will be cleaned and engineered for our interactive visualisation of fertility rates.
4.2.1 fertility
The following steps will be taken to clean and reshape “fertility”:
“fertility” tibble contains “na” strings which are not actually NA values, these points will need to be converted to NA values
fertility rate data from SingStat is in wide format with years as the columns, we will pivot long for year-wise plots
fertility rate data goes up till 2024, whereas the labour force data only goes up till 2022, we will filter the fertility rate data to only include years after 1990 and up till 2022
standardise age banding of fertility rate dataset to be consistent with labour force data. For example, “15-19” instead of “15 - 19 Years (Per Thousand Females)’
filtered to include age specific fertility rates and the total fertility rate by year
introduce Unit of Measurement (uom) column to indicate scaling for total fertility rates and age banded fertility rates
# Clean and reshape fertility data
fertility_clean <- fertility |>
clean_names() |>
rename(measure = data_series) |>
# Convert all columns to character to handle mixed types
mutate(across(-measure, as.character)) |>
pivot_longer(
cols = -measure,
names_to = "year",
values_to = "value"
) |>
mutate(
year = as.numeric(str_remove(year, "^x")), # Remove leading "x"
measure = str_trim(measure),
value = ifelse(tolower(value) == "na", NA, value), # Handle "na" strings
value = as.numeric(value)
) |>
# Extract age bands and filter for only age-specific rates
mutate(
age_band = case_when(
measure == "Total Fertility Rate (TFR) (Per Female)" ~ "All",
str_detect(measure, "15 - 19") ~ "15-19",
str_detect(measure, "20 - 24") ~ "20-24",
str_detect(measure, "25 - 29") ~ "25-29",
str_detect(measure, "30 - 34") ~ "30-34",
str_detect(measure, "35 - 39") ~ "35-39",
str_detect(measure, "40 - 44") ~ "40-44",
str_detect(measure, "45 - 49") ~ "45-49",
TRUE ~ NA_character_
)
) |>
filter(!is.na(age_band)) |> # Keep only age band rows
mutate(
uom = case_when(
age_band == "All" ~ "per female",
TRUE ~ "per thousand females")
) |>
filter(year > 1990 & year <= 2020) |>
select(year, age_band, fertility_rate = value, uom)4.2.2 not_working
The following steps will be taken to clean and reshape “not_working”:
standardise column names to the 7 (15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49) age bands to be consistent with fertility and remove extra bandings
for labour datasets, divide labour_force values by 1000 to align with count (in thousands) y-axis variable
some outside_labour_force values are “-” which are not valid numerics, convert these to NA
rename age column to age_band to match fertility
aggregate age bands to introduce All to represent population outside labour force by year and marital status only, this is so that we can introduce interactivity with total fertility rate and fertility rates across age bands
# Clean and reshape not_working data
not_working_clean <- not_working |>
clean_names() |>
# Filter for the 7 age bands only
filter(age %in% c("15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49")) |>
# Replace "-" with NA and convert to numeric
mutate(
outside_labour_force = na_if(outside_labour_force, "-"),
outside_labour_force = as.numeric(outside_labour_force),
outside_labour_force = outside_labour_force / 1000, # Convert to thousands
age_band = age # Rename for consistency with fertility_clean
) |>
select(year, sex, marital_status, age_band, outside_labour_force)
# Create aggregated "All" row by year, sex, and marital_status
not_working_all <- not_working_clean |>
group_by(year, sex, marital_status) |>
summarise(
age_band = "All",
outside_labour_force = sum(outside_labour_force, na.rm = TRUE),
.groups = "drop"
)
# Combine original cleaned data with the aggregated "All" row
not_working_clean <- bind_rows(not_working_clean, not_working_all)4.2.3 work
“work” tibble is cleaned in a similar way to “not_working”.
# Clean and reshape work data
work_clean <- work |>
clean_names() |>
# Filter to only the 7 standard age bands
filter(age %in% c("15-19", "20-24", "25-29", "30-34", "35-39", "40-44", "45-49")) |>
# Replace "-" with NA and convert labour_force to numeric
mutate(
labour_force = na_if(labour_force, "-"),
labour_force = as.numeric(labour_force),
labour_force = labour_force / 1000, # Convert to thousands
age_band = age # Rename to match other datasets
) |>
select(year, sex, marital_status, age_band, labour_force)
# Create aggregated "All" row by year, sex, and marital_status
work_all <- work_clean |>
group_by(year, sex, marital_status) |>
summarise(
age_band = "All",
labour_force = sum(labour_force, na.rm = TRUE),
.groups = "drop"
)
# Combine original cleaned data with the aggregated "All" row
work_clean <- bind_rows(work_clean, work_all)4.3 Outliers
Using the IQR method to identify outliers in the datasets, we will plot a time series line graph to visualise the points where the outliers occur.
# IQR-based outlier detection function
detect_outliers_iqr <- function(df, value_col, group_cols) {
df |>
group_by(across(all_of(group_cols))) |>
mutate(
Q1 = quantile(.data[[value_col]], 0.25, na.rm = TRUE),
Q3 = quantile(.data[[value_col]], 0.75, na.rm = TRUE),
IQR = Q3 - Q1,
lower_bound = Q1 - 1.5 * IQR,
upper_bound = Q3 + 1.5 * IQR,
is_outlier = .data[[value_col]] < lower_bound | .data[[value_col]] > upper_bound
) |>
ungroup()
}
# Apply to each dataset and exclude All (which is all ages)
fertility_outliers_flagged <- fertility_clean |>
filter(age_band != "All") |>
detect_outliers_iqr("fertility_rate", "age_band")
work_outliers_flagged <- work_clean |>
filter(age_band != "All") |>
detect_outliers_iqr("labour_force", c("age_band", "marital_status", "sex"))
not_working_outliers_flagged <- not_working_clean |>
filter(age_band != "All") |>
detect_outliers_iqr("outside_labour_force", c("age_band", "marital_status", "sex"))# Fertility rate outliers (excluding Total Fertility Rate)
ggplot(
filter(fertility_outliers_flagged, age_band != "All"),
aes(x = year, y = fertility_rate, color = age_band, group = age_band)
) +
geom_line(linewidth = 0.8) +
geom_point(
data = filter(fertility_outliers_flagged, is_outlier & age_band != "All"),
color = "red", size = 2, shape = 21, fill = "white"
) +
facet_wrap(~age_band, scales = "free_y") +
labs(
title = "Fertility Rates by Age Band with Outliers (Excluding Total Fertility Rate)",
y = "Fertility Rate", x = "Year", color = "Age Band"
) +
theme_minimal()# Labour force (working)
ggplot(work_outliers_flagged, aes(x = year, y = labour_force, color = age_band, group = interaction(age_band, marital_status))) +
geom_line(linewidth = 0.8) +
geom_point(
data = filter(work_outliers_flagged, is_outlier),
shape = 21, fill = "white", color = "red", size = 2,
position = position_jitter(width = 0.5, height = 0.2)
) +
facet_grid(sex ~ marital_status, scales = "free_y") +
labs(
title = "Labour Force by Age Band, Marital Status and Sex",
y = "Labour Force (in thousands)", x = "Year", color = "Age Band"
) +
theme_minimal(base_size = 12) +
theme(
strip.text = element_text(size = 11),
axis.text.x = element_text(angle = 45, hjust = 1)
)Warning: Removed 56 rows containing missing values or values outside the scale range
(`geom_line()`).
# Outside labour force (not working)
ggplot(not_working_outliers_flagged, aes(x = year, y = outside_labour_force, color = age_band, group = interaction(age_band, marital_status))) +
geom_line(linewidth = 0.8) +
geom_point(
data = filter(not_working_outliers_flagged, is_outlier),
shape = 21, fill = "white", color = "red", size = 2,
position = position_jitter(width = 0.5, height = 0.2)
) +
facet_grid(sex ~ marital_status, scales = "free_y") +
labs(
title = "Outside Labour Force by Age Band, Marital Status and Sex",
y = "Outside Labour Force (in thousands)", x = "Year", color = "Age Band"
) +
theme_minimal(base_size = 12) +
theme(
strip.text = element_text(size = 11),
axis.text.x = element_text(angle = 45, hjust = 1)
)Warning: Removed 87 rows containing missing values or values outside the scale range
(`geom_line()`).
The graphs generally show consistency with reported fertility rates and labour force participation trends in Singapore.
5 Joining datasets
We will join the datasets together to create a single tibble that contains all the necessary information for our visualisation. The joined tibble will contain the following columns:
year: from 1990 to 2022age_band: Age bands and “All” which is for total fertility ratemarital_status: Marital status of the data pointfertility_rate: Fertility rate by age band (per thousand females) and total fertility rate (per female)uom: Fertility rate unit of measurementlabour_status: Labour status of the data point, either “labour_force” or “outside_labour_force”count: Number of females either in workforce or outside workforce (in thousands)
# Filter labour data to only include females
work_clean_female <- work_clean |>
filter(sex == "female") |>
select(-sex)
not_working_clean_female <- not_working_clean |>
filter(sex == "female") |>
select(-sex)A full_join() is used to combine both work_clean_female and not_working_clean_female tibbles, ensuring that all rows from both tibbles are included to combine the labour force columns. The join is done on the year, marital_status, and age_band columns, common dimensions to both tibbles to prevent any data loss.
# Combine female labour and not working into one tibble
labour_status_female <- full_join(
work_clean_female,
not_working_clean_female,
by = c("year", "marital_status", "age_band")
)A left_join() is used joining the fertility_clean tibble to the labour_status_female tibble, ensuring that all rows from fertility_clean are included. This will allow us to combine and be able to associate fertility rates with labour force participation data.
# Join to fertility
fertility_labour_joined_female <- fertility_clean |>
left_join(labour_status_female, by = c("year", "age_band"))Conversion of labour_force and outside_labour_force columns to have a single column dictating labour status. years that do not have corresponding labour force data (1995, 2000, 2005) are filtered out.
# Pivot longer and ensure no sex columns remain
final <- fertility_labour_joined_female |>
pivot_longer(
cols = c("labour_force", "outside_labour_force"),
names_to = "labour_status",
values_to = "count"
) |>
group_by(year) |>
filter(!all(is.na(count))) |> # to deal with missing years in labour force data
ungroup() |>
mutate(count = replace_na(count, 0))
glimpse(final)Rows: 1,296
Columns: 7
$ year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2019, 2019, 2019, 2…
$ age_band <chr> "All", "All", "All", "All", "All", "All", "All", "All",…
$ fertility_rate <dbl> 1.10, 1.10, 1.10, 1.10, 1.10, 1.10, 1.14, 1.14, 1.14, 1…
$ uom <chr> "per female", "per female", "per female", "per female",…
$ marital_status <chr> "married", "married", "single", "single", "widowed_divo…
$ labour_status <chr> "labour_force", "outside_labour_force", "labour_force",…
$ count <dbl> 404.2, 100.2, 284.1, 158.6, 37.9, 3.2, 416.5, 106.2, 27…
summary(final) year age_band fertility_rate uom
Min. :1991 Length:1296 Min. : 0.10 Length:1296
1st Qu.:1998 Class :character 1st Qu.: 2.16 Class :character
Median :2007 Mode :character Median : 11.30 Mode :character
Mean :2006 Mean : 34.01
3rd Qu.:2014 3rd Qu.: 53.77
Max. :2020 Max. :130.00
marital_status labour_status count
Length:1296 Length:1296 Min. : 0.00
Class :character Class :character 1st Qu.: 1.70
Mode :character Mode :character Median : 10.95
Mean : 39.62
3rd Qu.: 49.70
Max. :416.90
sum(is.na(final$year))[1] 0
sum(is.na(final$age_band))[1] 0
sum(is.na(final$fertility_rate))[1] 0
sum(is.na(final$uom))[1] 0
sum(is.na(final$marital_status))[1] 0
sum(is.na(final$labour_status))[1] 0
sum(is.na(final$count))[1] 0
unique(final$marital_status)[1] "married" "single" "widowed_divorced"
unique(final$age_band)[1] "All" "15-19" "20-24" "25-29" "30-34" "35-39" "40-44" "45-49"
unique(final$labour_status)[1] "labour_force" "outside_labour_force"
6 Data Visualisation
final_cleaned <- final %>%
mutate(
labour_status_label = tools::toTitleCase(gsub("_", " ", labour_status))
)
# aggregate counts so there is only one value per year × marital_status × labour_status
agg_counts <- final_cleaned %>%
group_by(year, marital_status, labour_status_label) %>%
summarise(count = sum(count, na.rm = TRUE), .groups = "drop")
# SharedData on the aggregated counts (so the dropdown filters the bars)
shared_counts <- SharedData$new(agg_counts, group = "labour_status_selection")
## 2. Filter widget
filter_widget <- filter_select(
id = "labour_status_filter",
label = "Select Labour Status:",
sharedData = shared_counts,
group = ~labour_status_label
)
## 3. Build the plot
# compute scaling factor
max_count <- max(agg_counts$count, na.rm = TRUE)
max_fertility <- max(final_cleaned$fertility_rate, na.rm = TRUE)
scale_factor <- if (max_fertility>0) (max_count * 0.5)/max_fertility else 1
p <- ggplot() +
# Bars: use the aggregated, shared data
geom_col(
data = shared_counts,
aes(x = year, y = count, fill = marital_status, text = paste("Count:", count)),
position = position_dodge(preserve = "single")
) +
# Lines: use the original, detailed fertility data
geom_line(
data = final_cleaned %>% filter(age_band != "All"),
aes(x = year, y = fertility_rate * scale_factor, color = age_band, group = age_band),
linewidth = 0.5
) +
scale_y_continuous(
name = "Labour Count (bars)",
sec.axis = sec_axis(~ . / scale_factor, name = "Fertility Rate (lines)")
) +
scale_fill_brewer(
palette = "Pastel1",
name = "Marital Status:",
labels = ~tools::toTitleCase(gsub("_", " ", .x))
) +
scale_color_brewer(
palette = "Set2",
name = "Age Band (fertility):"
) +
labs(
title = "Fertility Rate and Labour Status Over Time",
subtitle = "Use the dropdown to switch between labour‑force vs outside‑labour‑force",
x = "Year"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 16),
axis.title = element_text(face = "bold")
)Warning in geom_col(data = shared_counts, aes(x = year, y = count, fill =
marital_status, : Ignoring unknown aesthetics: text
## 4. Convert to Plotly and arrange with the widget
interactive_plot <- ggplotly(p, tooltip = c("x","fill","text")) %>%
layout(
legend = list(orientation="h", xanchor="center", x=0.5, y=-0.2),
margin = list(b=100)
)
tagList(
filter_widget,
interactive_plot
)7 Data Visualisation Ver 2
library(dplyr)
library(ggplot2)
library(crosstalk)
library(plotly)
# 1. Clean & title-case labour status
final_cleaned <- final %>%
mutate(
labour_status_label = tools::toTitleCase(gsub("_", " ", labour_status))
)
# 2. Aggregate counts by year × labour status × marital status
agg_counts <- final_cleaned %>%
group_by(year, labour_status_label, marital_status) %>%
summarise(count = sum(count, na.rm = TRUE), .groups = "drop")
# 3. SharedData for the bars (so filter_select only affects labour status)
shared_counts <- SharedData$new(agg_counts, group = "labour_status_selection")
# 4. Labour-status filter widget
filter_widget <- filter_select(
id = "labour_status_filter",
label = "Select Labour Status:",
sharedData = shared_counts,
group = ~labour_status_label
)
# 5. Compute secondary-axis scale factor
max_count <- max(agg_counts$count, na.rm = TRUE)
max_fertility <- max(final_cleaned$fertility_rate, na.rm = TRUE)
scale_factor <- if (max_fertility > 0) (max_count * 0.5) / max_fertility else 1
# 6. Build ggplot with year in hover text
p_final <- ggplot() +
# Bars: dodged by marital status, filtered by labour status
geom_col(
data = shared_counts,
aes(
x = year,
y = count,
fill = marital_status,
text = paste0(
"Year: ", year,
"<br>Labour status: ", labour_status_label,
"<br>Marital status: ", tools::toTitleCase(marital_status),
"<br>Count: ", count
)
),
position = position_dodge(width = 0.8, preserve = "single"),
width = 0.7
) +
# Lines: fertility rate over time, with year in hover
geom_line(
data = final_cleaned %>% filter(age_band != "All"),
aes(
x = year,
y = fertility_rate * scale_factor,
color = age_band,
group = age_band,
text = paste0(
"Year: ", year,
"<br>Age band: ", age_band,
"<br>Fertility rate: ", round(fertility_rate, 2)
)
),
linewidth = 0.8
) +
# Dual y-axis
scale_y_continuous(
name = "Labour Count (bars)",
sec.axis = sec_axis(~ . / scale_factor, name = "Fertility Rate (lines)")
) +
# Colour palette for marital status
scale_fill_brewer(
palette = "Pastel1",
name = "Marital Status",
labels = ~tools::toTitleCase(gsub("_", " ", .x))
) +
# Colour palette for age bands
scale_color_brewer(
palette = "Set2",
name = "Age Band"
) +
labs(
title = "Fertility Rate and Labour Status × Marital Status Over Time",
subtitle = "Filter by in- vs outside-labour force; bars show marital status breakdown",
x = "Year"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 16),
axis.title = element_text(face = "bold"),
legend.position = "bottom",
legend.box = "vertical",
legend.title = element_text(face = "bold")
)Warning in geom_col(data = shared_counts, aes(x = year, y = count, fill =
marital_status, : Ignoring unknown aesthetics: text
Warning in geom_line(data = final_cleaned %>% filter(age_band != "All"), :
Ignoring unknown aesthetics: text
# 7. Convert to Plotly with enhanced hover and legend underneath
interactive_plot <- ggplotly(
p_final,
tooltip = "text"
) %>%
layout(
legend = list(
orientation = "h",
x = 0.5,
xanchor = "center",
y = -0.3
),
margin = list(b = 120)
)
# 8. Render widget + plot
tagList(
filter_widget,
interactive_plot
)8 References
Data.gov.sg. (n.d.-a). data.gov.sg. https://staging.data.gov.sg/datasets?query=household&page=1&searchColumns=Year&resultId=d_e19478b30d8f5cd6a1dc482bf2e46eb7
Data.gov.sg. (n.d.-b). data.gov.sg. https://staging.data.gov.sg/datasets?query=household&page=1&searchColumns=Year&resultId=d_e2475676af29ec78749f1b22cf8b301c
MacroTrends. (n.d.-a). Singapore unemployment rate. MacroTrends. Retrieved July 5, 2025, from https://www.macrotrends.net/global-metrics/countries/sgp/singapore/unemployment-rate
MacroTrends. (n.d.-b). Singapore population. MacroTrends. Retrieved July 5, 2025, from https://www.macrotrends.net/global-metrics/countries/sgp/singapore/population
Singapore Department of Statistics. (n.d.). Population by age group, sex and type of locality, 2023 [Table M810091]. Singapore Department of Statistics. Retrieved July 5, 2025, from https://tablebuilder.singstat.gov.sg/table/TS/M810091
Tan, T. (2024a, March 11). Singapore’s total fertility rate hits record low in 2023, falls below 1 for first time. The Straits Times. https://www.straitstimes.com/singapore/politics/singapore-s-total-fertility-rate-hits-record-low-in-2023-falls-below-1-for-first-time
Tan, T. (2024b, June 30). Why the fertility rate doesn’t capture socio-economic or cultural trends. The Straits Times. https://www.straitstimes.com/singapore/why-the-fertility-rate-doesn-t-capture-socio-economic-or-cultural-trends